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Abstract 

The analysis of structure and dynamics of biological networks plays a central role in understanding the intrinsic complexity 
of biological systems. Biological networks have been considered a suitable formalism to extend evolutionary and 
comparative biology. In this paper we present GASOLINE, an algorithm for multiple local network alignment based on 
statistical iterative sampling in connection to a greedy strategy. GASOLINE overcomes the limits of current approaches by 
producing biologically significant alignments within a feasible running time, even for very large input instances. The 
method has been extensively tested on a database of real and synthetic biological networks. A comprehensive comparison 
with state-of-the art algorithms clearly shows that GASOLINE yields the best results in terms of both reliability of alignments 
and running time on real biological networks and results comparable in terms of quality of alignments on synthetic 
networks. GASOLINE has been developed in Java, and is available, along with all the computed alignments, at the following 
URL: http://ferrolab.dmi.unict.it/gasoline/gasoline.html. 
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Introduction 

The structure and the dynamic of biological networks arise from 
interactions among molecules within the cell. Biological functions 
are obtained by the collaborative action of a number of cellular 
constituents, such as proteins, DNA, RNA and other small 
molecules [1]. Consequently, reductionism focusing only on the 
study of individual molecules and their limited interactions has 
shown its inadequacy in providing a comprehensive picture of 
living cells. Recently, high-throughput techniques for measuring 
protein-protein interactions (PPIs) have been introduced. Two- 
hybrid screening [2] and coimmuno-precipitation followed by 
mass-spectrometry [3] allowed the systematic study of protein 
interactions on a global scale. Extensive mining of scientific 
literature produces a variety of known biological interactions [4,5]. 
Several public and commercial databases, such as BioGRID [6], 
DIP [7], STRING [8], MINT [9], Yeast Proteome Database 
(YPD) [10] and Pathway Commons [1 1] collect specific knowledge 
in this area. The rapidly growing number and size of biological 
networks raises an important question on how can we make use of 
this network data to infer novel biological insights. A retrospective 
view of the recent history of molecular biology research shows that 
most of the attention has been devoted to sequence analysis. This 
indeed represents a fundamental level of biological investigation 
and for a long time has been the basis of evolutionary studies [12]. 
Recently, it has been shown that a system oriented approach to the 



study of biological phenomena may be more appropriate. In 
analogy with multiple sequence alignment, in which relevant 
functional parts of a single sequence are highlighted, common 
patterns in biological networks of different species provide an 
effective means of identifying functional modules (e.g., signaling 
pathways or protein complexes) conserved through evolution. 
Several research groups have proposed techniques to systemati- 
cally analyze and compare biological networks. A typical network 
analysis includes (i) network querying [13-17], which is commonly 
used to find structural (possibly) approximated motifs and to 
estabhsh whether such motifs are functional (over-represented); (ii) 
global and local network alignment [18—23], to understand if some 
functional complexes are conserved through species to infer 
evolutionary relationships among networks. Local network align- 
ment approaches based on Hidden Markov Models have been 
proposed [24,25] . However, their action has been restricted to the 
identification of shared paths. 

In this paper, we present GASOLINE (Greedy And Stochastic 
algorithm for Optimal Local multiple alignment of Interaction 
NEtworks), a novel algorithm for protein networks alignment 
based on iterative sampling [26] in connection with a greedy 
strategy. GASOLINE is inspired by the work of [27] and 
implements a seed-and-extend approach to extract shared 
complexes among a set of protein-interaction networks. The 
algorithm starts by identifying a set of similar nodes, one from each 
network, by using a Gibbs Sampling strategy. Then, a similar 
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technique is applied to extend the alignment. This step is iterated 
until the local density of the aligned subgraphs, measured through 
a properly defined degree ratio, increases. The algorithm iterates the 
above steps producing a set of local networks alignments (where 
each local alignment consists of a set of similar, in terms of both 
sequence and structure similarity, subgraphs). At the end of the 
process, each set of local networks alignments is ranked according 
to an index called Index of Structural Conservation. We 
extensively tested GASOLINE on: (i) a set of 25 biological 
networks drawn from STRING [8]; (ii) a set of artificially 
generated networks using the NAPABench tool [28] . 

We compared our system with a selected list of state-of-the-art 
methods such as SMETANA [23], IsoRank-N [22] and Network- 
Blast-M [19]. The experimental results show that GASOLINE 
outperforms the compared systems yielding the best results in 
terms of both quality (i.e. precision and recall) and running time. 
The method is very general and can be apphed to any type of 
networks (e.g. social, web, molecular) by using appropriate label 
and graph similarities. 

Methods 

Given weighted biological networks, where weights are 
probabilities expressing the rehability of pairwise protein relations 
(i.e. protein interactions in the case of PPI), informally, the local 
alignment of biological networks aims at finding a set of N regions 
(in our modeling subgraphs having the same number of proteins), 
one from each network, that are conserved in their sequence and 
interaction pattern. 

Such a problem is related to subgraph isomorphism, which is 
known to be NP-complete, therefore heuristics are needed. 
GASOLINE is able to produce an approximate solution through 
a stochastic-greedy strategy consisting of two phases. 

During the first step called bootstrap phase, we look for 
orthologous proteins across the networks and build a set of seeds. 
The set of seeds initially consists of A'^ proteins, one from each 
network, and includes all the starting nodes of the suboptimal local 
network alignment we are searching for. 

The second step, called iterative phase, repeatedly either adds 
(extension step) or removes (removal step) nodes in the network 
alignment, trying to maximize the final alignment score. Each 
extension step adds, in each network, a single node to the 
corresponding seed. During the extension step the seeds grow up 
producing a set of subgraphs, one from each network. The 
extension process is regulated by a properly defined degree ratio 
measuring the average density of the aligned subgraphs with 
respect to their neighbors in the networks. The extension is 
performed until this degree ratio increases. 

Each removal step replaces from the current alignment the set 
of proteins (one from each network) providing the minimum score. 

The initial phase and each extension step are performed 
through an iterative sampling. Consequently, diflFerent iterations of 
the algorithm may produce diflFerent local alignments. GASO- 
LINE iterates the above steps producing a set of local networks 
alignments. Those local alignments are ranked according to an 
Index of Structural Conservation (ISC) score combining topology 
and sequence similarity. 

GASOLINE implements preprocessing and post-processing 
steps. During preprocessing, the search space for potential seeds 
is reduced. This is obtained by marking only proteins having 
orthologs in all aligning networks and with a significant interaction 
degree in each network. 

All marked nodes in each network G, ( 1 < / < A^) are added to a 
set called S,. These sets will be used in the initial phase and will be 



Input: 

G: set of W PPI networks 
Si, Si, Sn: set of nodes fronn networks 
Gi, G2, Gn marked as potential seeds 



Initial phase: find an optimal 
alignment of seeds 



Remove seeds from 
Si, S2, Sn 




False 



Iterative phase: iteratively remove and add nodes 
for a fixed number of steps 



Extension step: 

extend seeds until the 
degree ratio increases 




Removal step: remove the 
set of mapped nodes with 
minimum Goodness score 


> 


t 










Output: 

set of local alignments 
of W subgraphs, ranked 
according to ISC score 



Figure 1. General description of GASOLINE. 

doi:10.1371/journal.pone.0098750.g001 

updated at each iteration. Finally, during post-processing, the final 
set of local alignments returned by GASOLINE is filtered by 
removing highly overlapping complexes. Flowchart in Figure 1 
provides a general description of GASOLINE. 

The computational complexity analysis assumes that all 
networks have the same number n of nodes and each iteration 
of the algorithm returns an alignment of subgraphs having size W. 
The complexity is given by 0{n^NW). Assuming that N < <n the 
complexity can be rewritten as 0{n^W). The worst case applies 
when networks are dense and very similar implying that the 
average size W of aligned complexes is 0{n). In this case the 
algorithm has an execution time 0{r?). In the average case we can 
assume that the size of aligned complexes is still function of n in 
particular W = 0{\/n). Therefore the complexity will be 0{rP"^). 
The best case applies when W =0{\). In this case the complexity 
will be 0{n^) (please refer to File SI for the details on the 
complexity analysis). Next section contains a more detailed 
description of the algorithm. 
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Network A Network B 




Figure 2. A toy example of extension and removal phases of GASOLINE algorithm in a pairwise alignment instance, (a) The nodes of 
the two aligned subgraphs are highlighted in green, (b) In red are highlighted those adjacent nodes of current alignment which will be explored by 
the sampling algorithm during the next step of iterative phase, (c) At the end ot such iteration the alignment will be extended with a new node in 
each network, (d) Once the iterative phase completes, it gives as a result the aligned subgraphs highlighted in green. In (e) the removal step identifies 
the cyan nodes as those contributing less to the alignment score, (f) These will be replaced with those capable to increase the alignment score. 
doi:1 0.1 371 /journal.pone.0098750.g002 



The algorithm 

Let N be the number of aligning networks. Eacli iteration of 
GASOLINE starts by searching an alignment of nodes (one node 
from each network) which can be viewed as the seeds of a 
candidate local alignment. Candidate proteins for the initial 
alignment are drawn from the sets of marked nodes (Si ,S2,--;Sf/), 
buUt in the preprocessing step. Once this step has produced a set of 
orthologous proteins, the iterative phase begins. Through this 
phase the seeds set is extended producing a list of subgrahps one 
from each subgraphs. 

At the end of the iterative phase the aligned seeds are removed 
from the sets S,- (in order to guarantee termination) and the process 
starts again from the bootstrap phase with new seeds proteins 
chosen in Si,...,S/v. 

Bootstrap phase. The search for an initial set of seeds is 
performed by a Monte Carlo Markov Chain in connection with a 
Gibbs Sampling algorithm. The Gibbs sampling builds a chain, 
where each state represents a combination (i.e. alignment) of 
proteins, one from each network. First, a random initial state is 
selected. Then, the sampling method iteratively performs a 
transition from a state to another, by replacing a randomly 
chosen protein of the current alignment with a protein of the same 
network, according to a properly defined transition probability 
distribution. By iterating this sampling procedure a sufficient 
number of times, we eventually achieve a good alignment of seeds. 

The transition probability is defined on top of a Similarity Score. 
Given two proteins a and b, we define their similarity score S{a,b) 
as either their Bit Score or the inverse of their BLAST E-value 
[29]. Let A' = {A\,...,A'j^} be the alignment of proteins at the i-th 
iteration of Gibbs sampling and suppose we remove the node A\ 
from it. Let be a candidate protein replacing A'l^. The similarity 
score of p is defined as the product of all similarity scores between 
p and the proteins still belonging to the alignment: 



SlM{p)= Tl^^l ^^I^S{A'^,p). The transition probability in p is 
then computed by using such the similarity scores as follows: 

P(D\A' A' A' A' )- ^^^^P^ 

Finally, the alignment score is defined as the sum-of-pairs of 
similarity scores between the aligned proteins: 

N N 

Score Seed = J2 Yl 

j=lk=l 

At the end of the bootstrap phase the alignment of seeds 
maximizing the sum-of-pairs score over all the iterations of the 
Gibbs sampling is chosen. 

Extension of current seeds. Let SG = {SGi,SG2,---,SGn} 
be an alignment of subgraphs, one for each network and Adji 
the set of nodes adjacent to one or more nodes in SGj. The goal of 
each extension step is to fmd an alignment A = {A\,A2,...,Af^} of 

proteins where AieAdj), and extend each SGj with Aj and the 
edges connecting Aj with the remaining nodes in SGj. 

Figure 2 shows a demo with two aligning networks. In Figure 2 
(a) the current alignment SG = {SGi,SG2}, consisting of two 
subgraphs composed by three nodes, is highlighted in green, with 
dashed lines connecting aligned proteins. Figure 2 (b) highlights in 
red all the nodes in Adji and Adj2. In Figure 2 (c) the new 
alignment of subgraphs yielded after a single extension step is 
shown in green. 
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Na=4000 Na.3om 




Figure 3. Phylogenetic trees for the synthetic networlts generated using NAPAbench. (a) 2-way alignment, (b) 4-way alignment, (c) 8-way 
alignment. Below each leaf node, the number of nodes and the average number of edges across the CG, DMC and DMR families of the corresponding 
network are shown in parenthesis. 
doi:1 0.1 371 /journal.pone.0098750.g003 



Each extension step is performed ttirough an iterative .sampling 
similar to the one described above, where a state of the Markov 
chain represents an alignment of nodes, one for each set Adji. 

Again, the initial state of the chain is randomly selected. Then, a 
series of transitions from a state to another one is made, by 
replacing a randomly chosen protein of the current alignment with 
a node of the same network in the corresponding adjacent set. The 
transition probabilities are computed by considering sequence 
similarity in connection with neighborhood similarity. 

Let A' = {A\,A'2 ■ ■ ■ ,A'f^} be the alignment of proteins at the f-th 
iteration of Gibbs sampling. Suppose we remove protein A'l^ from A' 
and let be a protein of the same network candidate to replace it. 

The Similarity Score of p takes into account both the orthology 
relation and the topological similarity between p and proteins in 
A\{Ai}. 

The orthology score of p which takes into account the 
sequence similarity, is defined as in the bootstrap phase: 

siMo(p)=nJl,,,^,s(p,^;). 

Concerning the topology similarity, we build a vector V , called 
topology vector, storing the weights of the edges linking p to the nodes 
oiSGk. If there is no link between two proteins the weight is set to 0. 
Likewise, we buUd a topology vector for all the proteins in A'\{A'i^}. 

Given two proteins a and b, and their topology vectors and 
Vi,, the topology similarity score is the scalar product of the two 
vectors: TOP(a, 6) = {Va,Vh) . The topology similarity score of /i 

nN 
j^i^TOP{j>,A'i). The overall 

similarity Score of p can the be computed as: 
SIM(/7) = SIMo(/7) X SIM7'(/7). By normalizing in the range 
[0,1] we obtain the transition probability of p: 



P(p\A\,...,A\_„A[ 



SIM(p) 



SIM(«) 



The alignment score is calculated in terms of orthology and 
topology similarity by the sum-of-pair of the pairwise alignments: 

ScoreExtend(^) = 



j=l k=\ 



./=1 k=\ 



At the end of Gibbs sampling, the alignment with highest sum- 
of-pair score is selected for the extension of subgraphs in SG. 

The extension of subgraphs mainly depends on a degree ratio of 
the alignment which evaluates the local density and the modularity 
of aligned subgraphs with respect to their neighborhood. Given an 
aligned subgraph SGi, the degree ratio of SGi is the number of 
edges linking nodes within SGi over the sum of the degrees of 
nodes in SGi. Then, the degree ratio of a subgraph alignment SG 
is the average degree ratio of aligned subgraphs in SG. The 
extension process is repeated until the following two properties 
hold: (i) aU mapped proteins are in orthology relation (w.r.t. 
BLAST E-values or Bit Scores); (ii) the degree ratio of SG stricdy 
increases. 

Removal step. In the removal step, we discard from the 
current alignment a set of mapped proteins which give a minimal 
contribution to alignment quality. Such a step tries to refine the 
topology of the ahgned subgraphs and therefore does not take into 
account the sequence similarity. The reason behind this choice is 
that during the extension steps the subgraph topology conservation 
intrinsically decreases since no backtracking is performed. Such a 
step deals with such an issue by making use of a measure called 
Goodness score. 

Let SG be the current subgraph alignment and let W be the 
number of proteins in each aligned subgraph. We can represent 
SG as a A'^ X IF matrix, where each column contains mapped 
proteins across all the networks. The goal of this step is to delete 
the column minimizing Goodness. We define the Goodness of a 
generic protein SG[ij] of alignment SG as the ratio between the 
internal degree of SG[ij\, i.e. the number of links connecting 
SG[ij\ to the remaining nodes in the aligned subgraph, and its 
node degree. The Goodness of column j is the product of the 
Goodness scores of all its proteins: 



GooDNESs(/') = nGooDNESS(5G[;V'l) 



Each removal step deletes from the current alignment the nodes 
corresponding to the column with the minimum Goodness score. 
However, such proteins could be added again to the alignment, in 
some future extension steps. In Figure 2 (d)-(e) we report a toy 
example of the removal step. Figure 2 (d) consists of the current 
local alingment identified through the iterative step; In (e) the 
removal step identifies two cyan nodes as those giving a minimal 
contribution to the alignment score; In (f) the nodes are replaced 
with to different nodes increasing the alignment score. 
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Figure 4. The Specificity (SPE) and Number of correct nodes (CN). SPE (a) and CN (b) for various level of bias between the similarity score 
distribution for orthologs and the similarity score distribution for non-orthologs. 
doi:1 0.1 371 /journal.pone.0098750.g004 



Notice that, the topology similarity score between proteins in 
the Gibbs sampling algorithm of the extension process is defined in 
order to reward structural conservation, edge weights and density 
of the aligning subgraphs. So, as long as the extension process 
continues, the degree ratio increases. However, it tends to reach 
local maxima, so the goal of the refinement phase is to try to shift 
from these local maxima, in order to reach a better approximation 
of the global maximum. 

Final alignments ranking. Once the algorithm completes 
die extraction of conserved subgraphs, GASOLINE ranks all the 
alignments through a score called Index of Structural Conservation 
(ISC) which measures its quality in terms of topology and sequence 
similarity. Let SG be the current subgraph alignment and let W be 
the number of nodes in each aligned subnetwork. SG can be 
represented as a matrix with N rows and W columns, where the i- 
th row stores proteins of the aligned subgraph SGj. The structural 
similarity score between two aligned subgraphs, P and Q (i.e. two 
rows of the above matrix), measures the similarity between the 
topology vectors of the corresponding proteins in the current 
mapping. Let x and y be two nodes and and Vy tiieir topology 



vectors. cInterACTIONS(x,j) denotes the percentage of entries in 
Vx and Vy that are either both null or both different from zero 
(consisting of conserved links in both species): 

cInteractions(x,>') = 



\{\<i<W:( VM ^0 A Vy{{\ ^Q) V =0 A F,,^ =0)}| 



The pairwise structural similarity score PairSim between P and 
Q is given by: 



PAIRSlM(P,g) = ^ ClNTERACTIONS(P[/],g[/]) 
/=! 

where P[i] and Q[i\ are the matched nodes in P and Q 
respectively. The structural similarity score of alignment A, 




GASOLINE 
SMETANA 



2000 2500 3000 3500 4000 4500 5000 5500 6000 6500 7000 7500 8000 
Size of ancestor network (Na) 



Figure 5. Running times of GASOLINE and SMETANA for different number of nodes (A'^) of the ancestor networic. 

doi:1 0.1 371/journal.pone.0098750.g005 
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Table 5. Features of 6 blolocial 


eukaryotic networks obtained from STRING. 






SPECIES 


# PROTEINS 


# PPIs 


Caenorhabditis elengans 


6173 


26184 


Drosophila melanogaster 


8624 


39466 


Homo sapiens 


12575 


86890 


Mus musculus 


9781 


52161 


Rattus norvegicus 


8763 


39932 


Saccharomyces cerevisiae 


6136 


166229 





doi:1 0.1 371 /journal.pone.0098750.t005 



StruCtSim(^), can be defined as the sum-of-pair of all pairwise 
structural similarity scores: 



StructSim(^) = 



N N 

^^PAIRSlM(^,-,^y) 

,=1 j=l 



According to this definition the maximum StrUCtSim value is 
N X IV, achieved by iV perfectly aligned cliques. Finally, the ISC 
of an alignment A can be defined as the normalization of 
StruCtSim(^) in the [0,1] interval: 



ISC(^) = 



StructSim(^) 
NxW 



Postprocessing. The final set of local alignments returned by 
GASOLINE is post-processed to filter out highly overlapping 
complexes. Alignments are sorted according to their size and ISC 
score. Let S0 = {S0^^,SG2,---,SG'ff} the local alignment of rank i 
in the sorted hst. For each subnetwork of the ahgnment SC , 
Perc(5'G^) denotes the percentage of proteins in SGJ. observed in 
the previous i—l alignments. Let Perc(5'G') the average value of 
Perc(5'Gj,) across all the networks. If Perc(5'G') is above a given 
threshold the alignment is discarded. 

Results and Discussion 

We performed a set of experiments on synthetic and real 
biological networks to asses the performance of GASOLINE. AU 
tests have been performed in a Intel Core 17-2670 2.2Ghz CPU 
with a RAM of 8 GB. 

Data Description and Experimental Setup 

Synthetic Biological networks were generated using NAPA- 
bench [28], a large-scale network alignment benchmark for 
generating families of evolutionary related synthetic PPI networks, 
evolved from a common ancestor, according to a given 
phylogenetic tree. It has been recently used as a framework to 
compare the accuracy and the scalability of different alignment 
algorithms [23,28]. 

Real Biological networks were taken from STRING (version 
9.0) [8], a database of known and predicted PPIs, collected from 
different high-throughput experiments, coexpression data and 
publications. For every examined species, we filtered the set of 
interactions, considering only experimentally supported interac- 
tions (i.e. those with positive values on "Experimental" field). We 



point out that this kind of protein interactions can also result from 
experimental knowledge transferred from one species to another. 
Three different case studies have been examined: 

1 . Pairwise and multiple alignments of synthetic networks; 

2. 6-way alignment of real PPI eukaryotic networks; 

3. 25-way alignment of real PPI vertebrata networks. 

In case studies a) and b) we compared our method against three 
different global and local multiple network alignment algorithms: 
SMETANA [23], IsoRankN [22] and NetworkBLAST-M [19]. 
To our knowledge, the first two methods are the best global many- 
to-many aligners of two or more species, while NetworkBlast-M 
represents the state-of-the-art for the local alignment problem. We 
chose both global and local alignment methods in order to (i) 
highlight the ability of GASOLINE to correctly map many 
proteins of different species as a good global aligner does; (ii) find 
many conserved complexes as a good local aligner does. In our 
experiments we ran IsoRankN with a = 0.7 and .^^=10 and we 
used the restricted-order version of NetworkBLAST-M for 
computational reasons. To compute similarities between proteins, 
we used Blast bit scores for GASOLINE, SMETANA and 
IsoRankN, and Blast E-values for NetworkBLAST-M. 

We used several measures to evaluate the specificity, the 
sensitivity and the functional consistency of the alignment 
algorithms, both for synthetic and for real biological networks, 
following the methodology described in [28]. We also tested the 
robustness of the analyzed methods in case of low sequence 
similarity between homologous proteins and the scalability with 
respect to the number and the size of aligned networks. 

In case study c) we tested the ability of GASOLINE to find 
highly conserved complexes across many species in reasonable 
time. Starting from the information about orthologous groups 
(COG, KOG and NOG) obtained from the STRING database, 
we computed the Jaccard similarity coefficient [30] between the 
sets of two proteins' orthology groups. That is defined as the 
number of common groups divided by the cardinality of the union 
of the two sets. 

The algorithm needs a few parameters to be set out: 

• IterSeed: number of iterations of Gibbs sampling in the 
bootstrap phase; 

• IterExtend: number of iterations of Gibbs sampling in the 
extension step; 

• IterPhase: number of iterations of each iterative phase; 

• a: threshold value for the degree of candidate seed nodes; 

• Overlap: threshold value for overlap percentage; 

• Minimum complex size(MCS): minimum number of proteins 
of a conserved complex; 
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Table 8. Best 10 complexes found by GASOLINE. 





RANK 


DESCR 


SIZE 


ISC 


GOs 


NetBlast 
RANK 


1 


Large and small subunit 


59 


85.6% 


16 


10, 12 




of ribosomes in the cytosol 








14, 15 


2 


Spllceosome 


40 


87.1% 


13 


5, 9 


3 


Proteasome 


32 


95% 


17 


2, 3 


4 


Ribosome biogenesis 


25 


89.2% 


11 


4, 16 


in the nucleolus 


5 


Protein serine/threonine 


25 


75.6% 


19 


34, 35 


kinase activity 


6 


DNA repair complex 


24 


92.5% 


39 


18 


7 


SSU processome 


22 


96.4% 


4 


1 


8 


DNA directed 


21 


94.2% 


13 


6, 7 


RNA polymerase 


9 


Vesicle-mediated transport 


20 


85.5% 


20 


19 


10 


Prefoldin complex 


19 


90.6% 


2 


37 



doi:10.1371/journal.pone.0098750.t008 



Some of these parameters were established experimentally: 

• IterSeed = 2QQ; 

• IterExtend = 200; 

• Overlap = 0.5; 

• IterPhase =10; 

Notice that, some parameters are strictiy related to the 
stochastic nature of the algorithm (i.e. IterSeed, IterExtend, 
IterPhase). Such parameters have been determined in connection 
to the convergence of the algorithm on the network instances 
tested. Therefore, we suggest these default parameters since are 
enough to yield good alignment results. 

The threshold parameter (cr) for the seed selection represents a 
tradeoff between speed and accuracy of our method. In order to 
maximize the accuracy and the coverage of GASOLINE, its value 
has been set to 1 in all comparisons. This means that no filtering 
on the nodes has been applied for the networks alignment. 
However, we give the possibility to the users to increase the value 
of such parameter for large input instance, as we did in third case 
study for the 25-way alignment. Concerning the Overlap 
parameter, it allows to filter the output produced by the algorithm. 
We chose an intermediate value (0.5) for this parameter. However, 
the user can vary this parameter to tune the number of subgraphs 
alignments that GASOLINE gives as output. 



As regards the MCS parameter, this essentially allows to set the 
smallest size of subgraphs alignments. In our experiments, we set 
this parameter to the minimum value (1), in order to maximize 
protein coverage, since we are comparing our method with global 
alignment algorithms too. Unlike the threshold parameter for the 
seed selection, it does not affect the running time of GASOLINE, 
since it concerns the postprocessing phase. 

Case study 1: alignment of synthetic networks 

We first assessed the performance of the proposed method on 
different datasets of synthetic similar PPI networks generated with 
NAPAbench [28]. We considered three different partitions of 
datasets. Each partition consists of three families of aligning 
networks, generated using three different network growth models, 
i.e. duplication-mutation-complementation model (DMC) [31], duplication 
with random mutations model (DMR) [32,33] and crystal growth model 
(CG) [34]. From now on, we will denote them as DMC, DMR and 
CG families. We set qcon=0.\ and qnmd = OA?, for DMC, 
q„ew = 0.2 and qdel = 0.5 for DMR and <5 = 4 for CG. 

The first partition is formed by families of 2 closely related 
networks, evolved from a common ancestor with Na = 5000 nodes. 
The families of the second partition consists of 4 evolutionary 
distant networks, with a common ancestor of A^a = 4000 nodes. In 
the last partition, each family contains 8 networks with different 



Table 9. GO enriched categories related to the Proteasome complex. 





GO category 


GASOLINE 


NetworkBlast-M 


GO:0000502 


5.551 E-1 7 


3.775E-16 


GO:0005839 


3.701 E-1 7 


l.llOE-16 


GO:0019773 


1.199E-15 


8.882 E-1 7 


GO:0051603 


1.480E-16 


2.405E-16 


GO:0004298 


5.551 E-1 7 


9.252E-17 



doi:l 0.1 371 /journal.pone.0098750.t009 
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Table 10. Running times of GASOLINE, SMETANA, NetworkBlast-M and IsoRank-N. 



Alignment 


GASOLINE 


SMETANA 


NetworkBlast-M 


IsoRank-N 


W-Y 


154 sec 


125 sec 


59 sec 


54460 sec 


H-M 


890 sec 


1587 sec 


205 sec 


16620 sec 


W-F-Y 


1 75 sec 


351 sec 


281 sec 


148320 sec 


W-H-M-Y 


409 sec 


6310 sec 


4854 sec 


>2 days 


W-F-H-M-Y 


533 sec 


13380 sec 


5999 sec 


>2 days 


All networks 


666 sec 


22185 sec 


12487 sec 


>2 days 



doi:1 0.1 371 /journal.pone.0098750.t01 0 



evolutionary distances, generated from a common ancestor of 
Ar„ = 3000 nodes. 

Figure 3 depicts the pliylogenetic trees used for the families of 
each partition. AH the branches of the phylogenetic tree have 
weight 500, meaning that each node of the tree (except the root) is 
a network obtained from the parent node by adding 500 nodes 
according to the growth model used. 

To measure the overall accuracy of the proposed methods, we 
used functional groups associated by NAPAbench to each protein 
of the aligning networks. We call equivalence class a set of proteins 
of different species (one or more for each network), which are 
mapped together by a given algorithm. An equivalence class is 



claimed as correct if all the included nodes belong to the same 
functional group. For each method we computed three different 
quality measures: 

• Specificity (SPE): the relative number of correct equivalence 
classes; 

• Coned nodes (CN): the total number of proteins assigned to 
correct equivalence classes; 

• Mean normalized entropy (MNE): the mean normalized entropy of 
the predicted equivalence classes. Given an equivalence class 
C, the normalized entropy of C is computed by: 



Table 11. Features of 25 biological eukaryotic networks downloaded from STRING. 



SPECIES 


# PROTEINS 


# PPIs 


Anolis carolinensis 


6510 


31135 


Bos taurus 


8474 


42234 


Canis familiaris 


8440 


42239 


Cavia porcellus 


8185 


42208 


Danio rerio 


5720 


25732 


Dasypus novemcinctus 


6850 


30495 


Equus caballus 


8144 


40703 


Felis catus 


7200 


32547 


Gallus gallus 


6409 


29534 


Gasterosteus aculeatus 


6018 


28276 


Homo sapiens 


12575 


86890 


Macaca mulatta 


8787 


41460 


Monodelphis domestica 


7800 


38002 


Mus musculus 


9781 


52161 


Ornithorhynchus anatinus 


6035 


26467 


Oryctolagus cuniculus 


8010 


39304 


Oryzias latipes 


5754 


26880 


Pan troglodytes 


8677 


44263 


Pongo pygmaeus 


8551 


43984 


Rattus norvegicus 


8763 


39932 


Sus scrofa 


6752 


29852 


Taenlopygia guttata 


6271 


28791 


Takifugu rubipres 


5872 


27077 


Tetraodon nigroviridis 


5779 


25730 


Xenopus tropicalis 


6153 


29769 



doi:1 0.1 371 /journal.pone.0098750.t01 1 



PLOS ONE I www.plosone.org 11 June 2014 | Volume 9 | Issue 6 | e98750 



Greedy and Stochastic Local Network Alignment 
Table 12. Best 10 conserved complexes found by GASOLINE for the alignment of 25 vertebrate PPI networks. 



RANK 


DESCR 


SIZE 


ISC 


GOs 


1 


Protein serine/threonine kinase activity complex 


26 


86.1% 


19 


2 


Proteasome 


20 


91.3% 


14 


3 


Nuclear receptor DNA complex 


16 


78.7% 


13 


4 


Histone deacetylase complex 


14 


85.4% 


14 


5 


Vesicle-mediated transport 


13 


86.5% 


10 


6 


Cyclin-dependent kinase complex 


13 


85.9% 


8 


7 


Chaperonin-containing T-complex 


13 


85.5% 


8 


8 


DNA directed RNA polymerase II 


12 


94.3% 


8 


9 


Eukaryotic translation initiation factor 3 


12 


91.8% 


5 


10 


Spliceosome 


11 


92.6% 


5 



doi:1 0.1 371 /journal.pone.0098750.t01 2 



mcy- 



logd 



where p, is the fraction of proteins in C that belong to the i-th 
functional group and d is the number of different functional 
groups. 

CN reflects the sensitivity of the method, while MNE measures 
the consistency of the predicted alignments. For SMETANA and 
IsoRankN we considered only equivalence classes that contain at 
least one node from each species. 

Tables 1 , 2 and 3 summarize the values of SPE, CN and MNE 
of the proposed methods for all the alignments of 2, 4 and 8 
networks, respectively. Each table reports the results obtained for 
DMC, DMR and CG families. In aU cases SMETANA has the 
highest sensitivity, recovering a high number of CN. However, our 



method resulted more precise, especially in the 8-way alignment, 
resulting in a higher specificity and a lower rate of false positives. 
The lower sensitivity of GASOLINE is due to the fact that our 
method is based on 1 -to- 1 mapping, while SMETANA performs a 
many-to-many alignment. The other two methods generally 
exhibit lower specificity, sensitivity and consistency than SME- 
TANA and GASOLINE. Interestingly, the specificity of GASO- 
LINE remains very high (around 90%), even though the number 
of networks increases, while the accuracy of all the other 
algorithms tends to decrease. In particular, the accuracy of 
NetworkBLAST-M falls down from parrwise to 8-way alignment, 
going from 88% to 4%. 

Table 4 compares the running times of the four algorithms for 
each of the nine network families considered. In the pairwise case, 
NetworkBLAST-M and SMETANA are the fastest methods, while 
in the multiple case GASOLINE shows die best performances. 




(a) (b) 

Figure 6. Meta-graph of complexes found by GASOLINE for the alignment of 25 PPI vertebrata networks, (a) Chaperonin complex, (b) 
Proteasome complex. Cyan indicates low conservation, green medium, yellow high and red very high. 
doi:1 0.1 371/journal.pone.0098750.g006 
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Surprisingly, for all DMR families SMETANA performed better 
than GASOLINE, even in the multiple case. This is probably due 
to the fact that networks in DMR families are sparser than the 
others and GASOLINE usually works better with denser networks. 
This hypotheses seems to be supported by the tests performed on 
the real biological networks, which are two or three times denser 
than the synthetic ones. 

Next, we investigated the effects of sequence similarities on the 
performances of the algorithms. Following the approach used in 
[23,28], we introduced a bias term b on the similarity score 
distribution of potential orthologs between different networks, in 
order to increase the differences between the similarity scores of 
orthologous nodes and those of non-ortologous nodes. We 
generated 6 different families of aligning networks, by varying b 
between —150 and 250. Negative values oi b penalize sequence 
similarity scores, while positive values of b enhance them, making 
the alignment easier to compute. AU families consist of 4 networks 
generated with CG model, using the phylogenetic tree of 
Figure 3b. 

Figure 4 reports the values of SPE and CN for different values of 
b. GASOLINE shows the most constant level of accuracy among 
all the methods, even for negative values oib. This means that our 
algorithm exploits topological informations well and it can 
produce many correct alignments even when sequence similarity 
scores are very noisy (73% of SPE, when b = —150). Similarly, 
SMETANA shows a constant level of accuracy, but its specificity is 
always below that of GASOLINE for non positive values of b. 
Surprisingly, for the lowest value of i (b= —150), our method 
recovers more correct nodes than SMETANA. On the other 
hand, NetworkBLAST-M and IsoRankN take great advantage 
from the increasing bias with respect to both SPE and CN values, 
so they seem to strongly rely on sequence similarity scores during 
the computation of the alignments. 

Finally, we tested the scalability of our method, based on the 
size of aligning networks. We generated 7 different families, by 
varying the number of nodes of the ancestral network, Na from 
2000 and 5000. Again, all famihes consist of 4 networks generated 
with CG model, using the phylogenetic tree of Figure 3 (b). We 
performed a comparison between GASOLINE and SMETANA, 
which are clearly the fastest methods, as shown before. Figure 5 
shows the running time for different values of Na . As can be seen, 
GASOLINE is always faster than SMETANA and generally shows 
less variance in running times. 

Case study 2: alignment of 6 PPI eukaryotic networks 

In the second case study, we compared the four algorithms on 
real biological networks of 6 species (yeast, worm, fly, human, 
mouse and rat). Table 5 describes the features of the networks. Bit 
scores and BLAST E-values between all pairs of proteins 
belonging to different networks were computed. All pairs with 
E- value greater than 10^' were filtered out. In order to compare 
the consistency and the accuracy of the algorithms, we used 
orthologous groups (COG, KOG and NOG), downloaded from 
STRING [8]. As in the previous case study, we define an 
equivalence class as a set of proteins of difierent species which are 
mapped together by a given algorithm. An equivalence class is 
claimed as correct if all the included nodes share at least one 
orthologous group. 

To asses the performance of the algorithms, we computed 
specificity (SPE) and number of correct nodes (CN), and we 
replaced the mean normalized entropy with a different measure, 
the mean group consistency (MGC), defined as follows: 



GriC) 



where C is the set of all predicted equivalence classes, 
CommonGr(C) is the set of groups shared by every protein in C 
and Gr{C) is the set of groups associated to at least one protein in 
C. 

We decided to change the consistency measure because a 
protein of a real biological network may be associated to more 
than one groups, whUe in the previous case study a protein was 
always associated to at most one group, assigned by NAPAbench 
during the generation of synthetic networks. 

Table 6 reports the quality measures for GASOLINE, 
SMETANA, NetworkBlast-M and IsoRank-N in the case of 
pairwise and 3-way alignment. In human-mouse alignment, 
IsoRank-N unexpectedly failed and did not recover any conserved 
group. Results show that GASOLINE has much higher SPE and 
MGC than the compared algorithms, especially in the 3-way 
alignment case. Moreover, the number of correct nodes found by 
GASOLINE are now comparable to those of SMETANA, or even 
higher. Low values of CN in NetworkBlast-M are probably due to 
the high threshold for the minimum size of complexes (which is 5). 
Furthermore, NetworkBlast-M exhibit lower values for all 
considered metrics than GASOLINE in all tested cases. 

Such results are confirmed for the alignment of 4, 5 and 6 
species (Table 7). It is worth noting that the specificity of 
GASOLINE remains very high (around 95%) and the differences 
between GASOLINE and the other methods increase (around 
20% specificity more than the second best algorithm, SME- 
TANA). In this case, quality measures are not reported for 
IsoRank-N because of its high running time (more than 2 days of 
computation). 

To sum up, the performance results of GASOLINE in the 
context of real biological networks are superior to those of 
synthetic networks, with respect to both specificity and number of 
correct nodes, which is related to the sensitivity of the algorithm. 
Moreover, the values of CN are very close to or even higher than 
SMETANA, though the latter is a global alignment method. 

A further comparison between GASOLINE and NetworkBlast- 
M was made to assess the statistical and biological significance of 
complexes found by both methods in the alignment of 6 species. 
We annotated aligned proteins with GO terms (cellular compo- 
nents, processes and functions), taken from BioDBNet [35]. We 
computed, for every GO category in each complex of the 
alignments, a /)-value based on the hypergeometric distribution. 
Finally, /(-values have been corrected by applying FDR correction 
for multiple hypotheses testing, with a = 0.01. 

Table 8 shows the 10 best complexes identified by GASOLINE, 
sorted by their size and ISC score. The number of enriched GO 
categories together with the ranking of the corresponding 
complexes found by NetworkBlast-M are reported. The table 
shows that the best results found by GASOLINE are also among 
the best results identified by NetworkBlast-M. 

GASOLINE found more complexes than NetworkBlast-M (46 
vs 45). However, most of the results are common to both methods. 
Nine small complexes (5-7 proteins) have been identified only by 
GASOLINE and eight small complexes (5-10 proteins) have been 
recovered only by NetworkBlast-M. 

Some of the complexes are correctly split by GASOLINE and 
wrongly joined in NetworkBlast-M, while other complexes in 
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GASOLINE are actually smaller than the corresponding ones in 
NetworkBlast-M. This is probably due to the different scoring 
functions used by the two methods. 

All complexes returned by NetworkBlast-M can include non 
1-to-l mapping between proteins of different networks. However, 
these have a frxed maximum size of 15 proteins. This is a serious 
limitation in the context of local alignment of biological networks 
since real biological complexes can be actually bigger [36] . Table 9 
shows that the most significant GO categories found by 
GASOLINE and NetworkBlast-M for the Proteasome complex 
have similar significant />-values. Nevertheless, the Proteasome 
complex found by GASOLINE includes more proteins than the 
one found by NetworkBlast-M (32 vs 15 proteins). 

In Table 10 we report the running times of GASOLINE, 
SMETANA, NetworkBlast-M and IsoRank-N. In the case of 
pairwise and 3-way alignment, NetworkBlast-M is faster than 
GASOLINE. However, GASOLINE clearly outperforms Net- 
workBlast-M and the other algorithms in the multiple case scaling 
well with the number of networks. 

Case study 3: alignment of 25 vertebrata PPI networks 

In the last case study, we collected a dataset of 25 vertebrata 
biological networks. Table 1 1 describees the features of these 
networks. We ran GASOLINE with higher values of MCS and a 
(MCS = 5, (7 = 7), for computational reasons due to the high 
number of aligned networks. We found 36 complexes conserved in 
all species. Table 12 lists the 10 highest-scored ones, together with 
the number of significantly enriched GO categories. 

Most of the complexes found by GASOLINE in the second case 
study are also present in this third one. However they are smaller 
here (i.e. sphceosome), due to (i) the higher number of aligned 
networks; (ii) incompleteness of PPI networks data in some species. 
GASOLINE took 2250 seconds (~38 minutes) to perform the 
alignment of aU 25 vertebrata PPI networks. 

We also analyzed phylogenetic relations among corresponding 
proteins of distant species in local alignments. Largest and most 
conserved complexes returned by GASOLINE, the proteasome 
and the chaperonin, were considered. We represented the 
conserved cluster of interactions as a single meta-graph 
(Figure 6), where nodes are classes of aligned proteins (one for 
each species) and edges are colored according to the conservation 
extent of the corresponding interaction. 
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